You’ve probably heard the idea that data analysis is 80 percent cleaning data or something like this. I think this is generally true, but “cleaning” obscures the wide variation of data manipulation that is required to go from a dataset found in the wild to a dataset that is ready to analyze.
One common issue that you’ll find in datasets that you want to work with is that they aren’t the right “shape” for the analysis you want to perform. In other words, you might be trying to calculate a new variable or make a chart, but the rows and columns in your dataset aren’t quite right. In R and the tidyverse, the shape of the data that is most commonly used is called “tidy data.”
Take for example the following two datasets of California, Oregon, and Washington population between 1990 and 2020:
df_a
## # A tibble: 12 × 3
## state year pop
## <chr> <dbl> <dbl>
## 1 CA 1990 29950111
## 2 CA 2000 33987977
## 3 CA 2010 37319550
## 4 CA 2020 39368078
## 5 OR 1990 2858547
## 6 OR 2000 3429708
## 7 OR 2010 3837614
## 8 OR 2020 4241507
## 9 WA 1990 4900780
## 10 WA 2000 5910512
## 11 WA 2010 6743009
## 12 WA 2020 7693612
df_b
## # A tibble: 4 × 4
## year CA OR WA
## <dbl> <dbl> <dbl> <dbl>
## 1 1990 29950111 2858547 4900780
## 2 2000 33987977 3429708 5910512
## 3 2010 37319550 3837614 6743009
## 4 2020 39368078 4241507 7693612
Both of these datasets contain the exact same data, but they are stored in a different format. The concept of tidy data describes why one of these datasets is easier to work with and one is harder.
Every column is a variable.
Every row is an observation.
Every cell is a single value.
In the example above, df_a is tidy, while
df_b is messy. Why?
In df_a we have three variables, state, year, and
population, and each of these variables is in it’s own column. In
df_a, each row represents an observation for one state in
one year. And in df_a, each there is only one value in each
cell of the table.
On the other hand, in df_b, each column is a state, but
in this case, states are actually values, not variables, and belong in
their own column. Additionally, there is no single column that
represents the variable we are most interested in, population. In
df_b, each row represents a year, but there are multiple
observations in each row for each year – one per state. The third
requirement of tidy data is not broken by df_b.
{ggplot2} and many other functions from the tidyverse work best when
data is in a tidy format. Let’s say we want to make line chart of
population over time by state using the example data above. In most line
charts we would have year on the x-axis, population on the y-axis, and
one line per state. If you remember last week’s lesson on {ggplot2}, in
order to do this in {ggplot2} we’ll need to assign these variables to
aesthetics. In this case, there are three aesthetics we
want map with our data: x, y, and
color.
Using df_a, this mapping of columns to aesthetics is
straightforward because we have a column for year, a column for
population, and a column for state:
df_a |>
ggplot(aes(x = year, y = pop, color = state)) +
geom_line()
But how would we make the same plot using df_b? We can’t
because population is spread across three columns, and the states are
column names, not values in our data.
df_b |>
ggplot(aes(x = year, y = ???, color = ???)) +
geom_line()
So, now that you’ve seen an example in which we need to tidy our data
in order to plot it using {ggplot2}, how do we go from messy to tidy?
Two key functions from the {tidyr} package to transform messy data to
tidy data are pivot_longer() and
pivot_wider().
If we want to go from df_b to df_a, we’re
going to need to “lengthen” our data. This is a common issue when data
is stored in column names. As described above, the data that we want to
be values in our table are the state abbreviations, which are currently
the column names.
df_b
## # A tibble: 4 × 4
## year CA OR WA
## <dbl> <dbl> <dbl> <dbl>
## 1 1990 29950111 2858547 4900780
## 2 2000 33987977 3429708 5910512
## 3 2010 37319550 3837614 6743009
## 4 2020 39368078 4241507 7693612
df_b_tidy <- df_b |>
pivot_longer(
cols = c(CA, OR, WA),
names_to = "state",
values_to = "pop"
)
df_b_tidy
## # A tibble: 12 × 3
## year state pop
## <dbl> <chr> <dbl>
## 1 1990 CA 29950111
## 2 1990 OR 2858547
## 3 1990 WA 4900780
## 4 2000 CA 33987977
## 5 2000 OR 3429708
## 6 2000 WA 5910512
## 7 2010 CA 37319550
## 8 2010 OR 3837614
## 9 2010 WA 6743009
## 10 2020 CA 39368078
## 11 2020 OR 4241507
## 12 2020 WA 7693612
The key arguments to pivot_longer() are:
cols – select which columns to pivot into valuesnames_to – name of new column names are stored invalues_to – name of new column values are stored
inIn order to tidy this data, we’ve created a new column
state which contains state abbreviations that were
previously stored as column names. We’ve created second new column
pop that now contains the population of each state in each
year. And now we can use the same ggplot2 code from above and make our
line chart.
df_b_tidy |>
ggplot(aes(x = year, y = pop, color = state)) +
geom_line()
Transforming data from wide to long, is probably the more common operation in data cleaning, but there are instances where you’ll want the data to be wide. This might be to make a table for presentation, or to run statistical models, or sometimes to create a tidy data set you need first lengthen and then widen data.
As a straightforward example to show how pivot_wider()
works, we’ll take our newly tidied data, df_b_tidy and turn
it make into the messy version we started with.
df_b_tidy |>
pivot_wider(
names_from = "state",
values_from = "pop"
)
## # A tibble: 4 × 4
## year CA OR WA
## <dbl> <dbl> <dbl> <dbl>
## 1 1990 29950111 2858547 4900780
## 2 2000 33987977 3429708 5910512
## 3 2010 37319550 3837614 6743009
## 4 2020 39368078 4241507 7693612
The key arguments to pivot_wider() are similar to
pivot_longer(), but in this case we need to specify in
which column the new column names should come from, and which column
contains the data or values we want to put in our wider table.
If you’ve taken a statistics or research methods class, you might have encountered the concept of data types. The idea is that data can be stored in different ways and depending on the type of data, you can conduct different kinds of analyses. The most general data types are quantitative and qualitative. At a high-level, quantitative data are things that can be measured and represented by numbers (e.g. state population) and qualitative data are things that can be measured and represented by text (e.g. state flower). Within both quantitative and qualitative data, there are more specific types and these data types have equivalents when working with data in programming languages. We’re not going to talk about all the data types and when and why you need them in this lesson. The basic concept to understand is that data types represent things in the real world that we measure and store and we should choose the most appropriate computational data type for the real world thing that we are measuring.
There are two more specific data types of qualitative data that are relevant to this lesson. Categorical (also called nominal) data is quantitative data that has no inherent order or ranking in it. For example, the possible values of eye color in a DOC data system may be: blue, green, brown, gray, hazel. But there is no way to rank these eye colors from highest to lowest or best to worst. Ordinal data, on the other hand, is data that has an clear ordering of the options. For example, the possible values of a risk assessment in a DOC data system may be: low, medium, high. In this case, it is possible to order these values from least to most risk.
Using appropriate data types is essential for a number of statistical processes, but in this lesson, we care about what data types can do for us when plotting data using {ggplot2}. The biggest reason we care about data types, and specifically categorical and ordinal data is that they dictate the order of groups and categories in plots.
Let’s say we have data on average sentence length of people by assessed risk level and we’d like to plot this.
risk_sentence
## # A tibble: 4 × 2
## risk_level mean_sentence_length
## <chr> <dbl>
## 1 Low 12
## 2 Medium 18
## 3 High 30
## 4 Very high 60
An appropriate way to visualize this data is a bar chart, where risk level is on the x-axis and sentence length is on the y-axis. Let’s check it out!
risk_sentence |>
ggplot(aes(x = risk_level, y = mean_sentence_length)) +
geom_col() + # geom_col is equivalent to geom_bar(stat = "identity") from the last lesson
labs(
title = "Average sentence length by risk level",
x = "Risk level",
y = "Mean sentence length (months)"
)
There is nothing incorrect about this plot, but it is confusing to interpret because the risk levels are not in a sensible order (by default they are ordered alphabetically). A better plot would have the risk levels on the x-axis in order from lowest to highest.
To achieve this, we’ll have to use what’s called in R, a
factor variable. This is a special kind of qualitative
data that has a set number of possible categories and an in this case a
defined order. There is a package that is part of the tidyverse called
{forcats} that has a lot of useful functions to make it easier to work
with factors, so we’re going to make use of that. If you’ve already
attached the tidyverse, using library(tidyverse), {factors}
will be ready to use.
Let’s use mutate() from {dplyr} to modify the
risk_level column and change it into an ordered factor
(from low to very high). Within mutate(), we use
fct_relevel() to define the levels and orders of the new
factor we create.
risk_sentence_ordered <- risk_sentence |>
mutate(risk_level = fct_relevel(risk_level, "Low", "Medium", "High", "Very high"))
risk_sentence_ordered$risk_level
## [1] Low Medium High Very high
## Levels: Low Medium High Very high
Now, if we make this same plot as above it our x-axis is ordered how we want it!
risk_sentence_ordered |>
ggplot(aes(x = risk_level, y = mean_sentence_length)) +
geom_col() +
labs(
title = "Average sentence length by risk level",
x = "Risk level",
y = "Mean sentence length (months)"
)
Note that we use fct_relevel() to manually define the
order of the values or levels in the factor variable that we want to
create. In this next example, we’ll use a different function from
{forcats} to create a factor variable that is ordered by the value of
another column.
race_sentence <- tribble(
~race, ~mean_sentence_length,
"Black", 12,
"Asian", 18,
"White", 30,
"Latino", 60
)
Here, we have a similar dataset as above, but this time we have
average sentence length by the race of the person sentenced. We have
very similar ggplot2 code, except we change the dataset we are using and
change the variable mapped to the x-axis aesthetic from
risk_level to race.
race_sentence |>
ggplot(aes(x = race, y = mean_sentence_length)) +
geom_col() +
labs(
title = "Average sentence length by race",
x = "Race",
y = "Mean sentence length (months)"
)
Again, there is nothing “wrong” about this plot, but as with the risk
level plot, the x-axis is ordered alphabetically by default. To make
this easier to interpret, we might want to order the x-axis by average
sentence length so it’s easier for some to quickly see a pattern. We’re
going to use fct_reorder() to accomplish this.
race_sentence_ordered <- race_sentence |>
mutate(race = fct_reorder(race, mean_sentence_length))
race_sentence_ordered$race
## [1] Black Asian White Latino
## Levels: Black Asian White Latino
The key arguments to fct_reorder() are .f
and .x where .f (the first argument) is a
character or factor column you want to modify and .x is
column you want to use to sort or order by. In this case, we are saying
that we want to modify race to become an ordered factor
variable that is sorted by mean_sentence_length.
So looking at the plot again, we get:
race_sentence_ordered |>
ggplot(aes(x = race, y = mean_sentence_length)) +
geom_col() +
labs(
title = "Average sentence length by race",
x = "Race",
y = "Mean sentence length (months)"
)
Now instead of the x-axis being in alphabetical order, it is ordered
by sentence length (low to high). Depending on what we are trying to
convey in this plot, we might want to reverse the order so that x-axis
is ordered from high to low. To do this, set the .desc
argument in fct_reorder() to TRUE so that
race is now ordered in descending order based on
mean_sentence_length.
race_sentence_ordered |>
mutate(race = fct_reorder(race, mean_sentence_length, .desc = TRUE)) |>
ggplot(aes(x = race, y = mean_sentence_length)) +
geom_col() +
labs(
title = "Average sentence length by race",
x = "Race",
y = "Mean sentence length (months)"
)
Another common use case for factors in ggplot2 is adjusting the order of legend items. Returning to our plot from the tidy data section of state population, notice that by default, the legend is ordered alphabetically (CA, OR, WA).
df_a |>
ggplot(aes(x = year, y = pop, color = state)) +
geom_line()
This is another case, where this plot is not incorrect, but could be improved if we adjust the items in the legend so that they are in order of the lines in our plot. Factors to the rescue!
We’ll use the same concept as ordering the race column above, but
this time reorder the state column to be sorted (in
descending order) by pop.
df_a |>
mutate(state = fct_reorder(state, pop, .desc = TRUE)) |>
ggplot(aes(x = year, y = pop, color = state)) +
geom_line()
Subtle difference, but now this plot is easier to read because the legend items are in the same order as the lines.
{forcats} has many other functions that help you create and modify
factors, but, but for the purposes of plotting,
fct_relevel() and fct_reorder() are two of the
most useful!
We covered the basic transformations needed to convert data from wide to long (and back again), but the data you find in the wild will likely be even messier than the example data we used above. For this next section we’ll use a dataset published by the Vera Institute of jail population by county. You can read in the CSV file from Vera’s GitHub like this:
url <- "https://github.com/vera-institute/incarceration-trends/raw/master/incarceration_trends.csv"
vera_incarceration <- read_csv(url)
This is quite a dataset, so for now we’ll work with one state (Arizona) and a subset of columns in the data set.
az_jail_county <- vera_incarceration |>
filter(state == "AZ") |>
select(
year,
state,
county_name,
urbanicity,
total_pop_15to64,
female_pop_15to64,
male_pop_15to64,
total_jail_pop,
male_jail_pop,
female_jail_pop
)
When I’m working with a new dataset, I almost always start with some basic exploration to see what we’ve got.
az_jail_county |>
glimpse()
## Rows: 735
## Columns: 10
## $ year <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978…
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ"…
## $ county_name <chr> "Apache County", "Apache County", "Apache County", "…
## $ urbanicity <chr> "rural", "rural", "rural", "rural", "rural", "rural"…
## $ total_pop_15to64 <dbl> 17173, 18752, 19818, 21418, 22079, 23811, 24868, 275…
## $ female_pop_15to64 <dbl> 8886, 9698, 10248, 11086, 11439, 12357, 12909, 14299…
## $ male_pop_15to64 <dbl> 8287, 9054, 9570, 10332, 10640, 11454, 11959, 13250,…
## $ total_jail_pop <dbl> 7.00, 7.50, 8.00, 8.50, 9.00, 9.50, 10.00, 10.50, 11…
## $ male_jail_pop <dbl> 4.00, 4.75, 5.50, 6.25, 7.00, 7.75, 8.50, 9.25, 10.0…
## $ female_jail_pop <dbl> 0.00, 0.12, 0.25, 0.38, 0.50, 0.62, 0.75, 0.88, 1.00…
What years does this data cover?
range(az_jail_county$year)
## [1] 1970 2018
What counties are included and how many rows per county?
az_jail_county |>
count(county_name)
## # A tibble: 15 × 2
## county_name n
## <chr> <int>
## 1 Apache County 49
## 2 Cochise County 49
## 3 Coconino County 49
## 4 Gila County 49
## 5 Graham County 49
## 6 Greenlee County 49
## 7 La Paz County 49
## 8 Maricopa County 49
## 9 Mohave County 49
## 10 Navajo County 49
## 11 Pima County 49
## 12 Pinal County 49
## 13 Santa Cruz County 49
## 14 Yavapai County 49
## 15 Yuma County 49
What is urbanicity?!
az_jail_county |>
count(urbanicity)
## # A tibble: 4 × 2
## urbanicity n
## <chr> <int>
## 1 rural 343
## 2 small/mid 294
## 3 suburban 49
## 4 urban 49
Cool! We have a basic handle on this data and one of the first things I noticed is that we’ve got un-tidy data. We have data for total population, and jail population, but this data is also broken down by male and female. And each of these sub-populations has it’s own column. If for example we wanted to make a line chart of jail population over time by sex, we couldn’t do it with data in this format.
az_jail_county |>
ggplot(aes(x = year, y = ???, color = ???))
So, as before, we’ve got to tidy this data up! As you can see this data set is a little more complicated than our example dataset above, so we’ll need to do a few more transformations before we can this data into a tidy format we can plot and analyze.
The first step is to pivot_longer() so that all our
values become a single population column that we’ll call
n.
az_jail_county |>
pivot_longer(
cols = -c(year, state, county_name, urbanicity),
names_to = "var",
values_to = "n"
)
## # A tibble: 4,410 × 6
## year state county_name urbanicity var n
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1970 AZ Apache County rural total_pop_15to64 17173
## 2 1970 AZ Apache County rural female_pop_15to64 8886
## 3 1970 AZ Apache County rural male_pop_15to64 8287
## 4 1970 AZ Apache County rural total_jail_pop 7
## 5 1970 AZ Apache County rural male_jail_pop 4
## 6 1970 AZ Apache County rural female_jail_pop 0
## 7 1971 AZ Apache County rural total_pop_15to64 18752
## 8 1971 AZ Apache County rural female_pop_15to64 9698
## 9 1971 AZ Apache County rural male_pop_15to64 9054
## 10 1971 AZ Apache County rural total_jail_pop 7.5
## # ℹ 4,400 more rows
Notice that instead of specifying which columns I want to pivot in
the cols argument, in this case I specified which columns I
did not want to pivot using -c() (i.e. all columns
except the ones I named).
This is a good first step, but that the values of our new
var column show that we have data for total population,
female population, and male population as well as data for total jail
population, female jail population, and male jail population. This is
actually two separate variables or concepts in one column: demographic
group (total, male, female) and population type (total county and jail).
We can use another function from {tidyr} to split this column into it’s
two parts, separate_wider_delim(). With this function we
can split a single column into two (or more) new columns based on a
delimiter we specify that identifies how to split the data.
az_jail_county |>
pivot_longer(
cols = -c(year, state, county_name, urbanicity),
names_to = "var",
values_to = "n"
) |>
separate_wider_delim(
cols = var,
delim = "_",
names = c("dem_group", "pop_type"),
too_many = "merge"
)
## # A tibble: 4,410 × 7
## year state county_name urbanicity dem_group pop_type n
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1970 AZ Apache County rural total pop_15to64 17173
## 2 1970 AZ Apache County rural female pop_15to64 8886
## 3 1970 AZ Apache County rural male pop_15to64 8287
## 4 1970 AZ Apache County rural total jail_pop 7
## 5 1970 AZ Apache County rural male jail_pop 4
## 6 1970 AZ Apache County rural female jail_pop 0
## 7 1971 AZ Apache County rural total pop_15to64 18752
## 8 1971 AZ Apache County rural female pop_15to64 9698
## 9 1971 AZ Apache County rural male pop_15to64 9054
## 10 1971 AZ Apache County rural total jail_pop 7.5
## # ℹ 4,400 more rows
The key arguments here are
cols: which column to splitdelim: what character delimiter to use to split onnames: names of new columns to createtoo_many: what to do if there is “extra” stuff in a
split columnAnd so you can see that we now have two columns called
dem_group and pop_type in place of the
var column. While we’re at it, let’s also clean up the
pop_type column so it’s a little more clear what this
column represents.
az_jail_county |>
pivot_longer(
cols = -c(year, state, county_name, urbanicity),
names_to = "var",
values_to = "n"
) |>
separate_wider_delim(
cols = var,
delim = "_",
names = c("dem_group", "pop_type"),
too_many = "merge"
) |>
mutate(
pop_type = if_else(pop_type == "pop_15to64", "pop_total", "pop_jail")
)
## # A tibble: 4,410 × 7
## year state county_name urbanicity dem_group pop_type n
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1970 AZ Apache County rural total pop_total 17173
## 2 1970 AZ Apache County rural female pop_total 8886
## 3 1970 AZ Apache County rural male pop_total 8287
## 4 1970 AZ Apache County rural total pop_jail 7
## 5 1970 AZ Apache County rural male pop_jail 4
## 6 1970 AZ Apache County rural female pop_jail 0
## 7 1971 AZ Apache County rural total pop_total 18752
## 8 1971 AZ Apache County rural female pop_total 9698
## 9 1971 AZ Apache County rural male pop_total 9054
## 10 1971 AZ Apache County rural total pop_jail 7.5
## # ℹ 4,400 more rows
And finally, the last step here to make a tidy data set is to pivot once again so that for each row we have one column that is the total population and one column that is jail population. While we’re at it, we might as well create one more new variable since we have jail population and total population: incarceration rate. Since we have this nice tidy dataset, that will just need some simple division.
az_jail_county_tidy <- az_jail_county |>
pivot_longer(
cols = -c(year, state, county_name, urbanicity),
names_to = "var",
values_to = "n"
) |>
separate_wider_delim(
cols = var,
delim = "_",
names = c("dem_group", "pop_type"),
too_many = "merge"
) |>
mutate(
pop_type = if_else(pop_type == "pop_15to64", "pop_total", "pop_jail")
) |>
pivot_wider(
names_from = "pop_type",
values_from = "n"
) |>
mutate(rate = pop_jail / pop_total)
az_jail_county_tidy
## # A tibble: 2,205 × 8
## year state county_name urbanicity dem_group pop_total pop_jail rate
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1970 AZ Apache County rural total 17173 7 0.000408
## 2 1970 AZ Apache County rural female 8886 0 0
## 3 1970 AZ Apache County rural male 8287 4 0.000483
## 4 1971 AZ Apache County rural total 18752 7.5 0.000400
## 5 1971 AZ Apache County rural female 9698 0.12 0.0000124
## 6 1971 AZ Apache County rural male 9054 4.75 0.000525
## 7 1972 AZ Apache County rural total 19818 8 0.000404
## 8 1972 AZ Apache County rural female 10248 0.25 0.0000244
## 9 1972 AZ Apache County rural male 9570 5.5 0.000575
## 10 1973 AZ Apache County rural total 21418 8.5 0.000397
## # ℹ 2,195 more rows
And here we have it! A tidy data set that we lengthened, separated, and widened which we’re going to use to explore some additional features of {ggplot2}.
Last lesson introduced making static plots in R using {ggplot2}. Today, we’re going to expand on that introduction and demonstrate some additional features of {ggplot2}. We are still just scratching the surface of what is possible using {ggplot2}, but hopefully these concepts will get you excited to explore more features and make beautiful plots.
Let’s start by applying some of the things we learned last lesson and earlier in this lesson to a plot of jail incarceration rate by county from our tidy Arizona dataset. We’ll start with a line plot of all the counties and see what we can see.
az_jail_county_tidy |>
filter(dem_group == "total") |>
ggplot(aes(year, rate, color = county_name)) +
geom_line()
The first thing that jumps out at me is that something odd happened in one of the counties starting in 1999 that appears to be a data error. It’s a hard to tell which county this is because we have too many lines and colors in our plot. But this is a great example of why plotting your data is so important. We may not have realized there was something odd going on with this county if we didn’t see it. (We’ll make a note to investigate further what is going on with this county later.)
Because there are too many overlapping lines, we can’t really see much here, so a better plot might focus on a subset of counties. For instance, we can look at the five largest counties in the state by population and make the same plot.
big_counties <- c("Maricopa County", "Pima County", "Pinal County",
"Yavapai County", "Yuma County")
az_jail_county_tidy |>
filter(
dem_group == "total",
county_name %in% big_counties
) |>
ggplot(aes(year, rate, color = county_name)) +
geom_line()
Okay! We can start to see something now. First, jail incarceration rates have gone up in all counties since the 1980s, the peak was sometime around 2010, rates have fallen since then. Looks like Yavapai County had the highest rate and Pinal County had the lowest rate in the last year of data we have available.
One easy way to improve this plot is by adding a title, subtitle, and axis labels that tells us what we’re looking at
az_jail_county_tidy |>
filter(
dem_group == "total",
county_name %in% big_counties
) |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Jail incarceration rate",
color = NULL
)
Note that I’ve excluded x-axis and legend titles by using
NULL in labs() because they are
self-explanatory.
This is better, but we still have a couple things we could improve.
First it is a little tricky to know which line is which county because
the legend is not in the same order as the lines in the chart.
Additionally, the y-axis labels are hard to interpret. To fix these
these things we’ll convert county_name to a factor that is
ordered by rate and change the y-axis labels to rate per 100,000
people.
library(scales)
az_jail_county_tidy |>
filter(
dem_group == "total",
county_name %in% big_counties
) |>
mutate(county_name = fct_reorder(county_name, rate, last, .desc = TRUE)) |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
)
This is improving, but we can do better! One technique that is very useful is direct labeling, or adding text labels to the plot itself. In this case, we could add the county name to each line and remove the legend entirely! When using this technique, I find it easiest to assign the cleaned up and filtered data set I want to plot to a new data frame before I plot.
One good way to add text labels to a plot is with
geom_text(). In the same way that we use
geom_line() for a line plot or geom_point()
for a scatterplot, geom_text() will plot text at specified
points on a grid. For geom_line() to work, we need map text
from a column in our input data to the label aesthetic.
big_county_to_plot <- az_jail_county_tidy |>
filter(
dem_group == "total",
county_name %in% big_counties
) |>
mutate(county_name = fct_reorder(county_name, rate, last, .desc = TRUE))
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
aes(label = county_name),
size = 3,
hjust = 0,
nudge_x = 0.25
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
)
Errrrrr. That doesn’t look great. For our direct label plan to work,
we don’t want one text label per point, we just want one per line. And I
think it will look best if we put the label at the far right of each
line. To do this, we can tell geom_text() to use a filtered
version of the main data set that only includes the most recent year of
data.
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
)
Alright, we’re getting somewhere. But we can’t really read the labels
and we still have the legend that we wanted to get rid of. To be able to
read the labels we’ve added to the plot we need to extend the area of
plot that is visible to us. We do that by turning “clipping” off in our
coordinate system and by adding a margin to whole plot. By default,
ggplot2 will cut off anything outside the plotting area, but by setting
clip = "off" we’ll be able to see the entire county name we
are plotting. Additionally, we’ll turn the legend off and increase the
right side margin of the plot to 75 pixels.
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
) +
coord_cartesian(clip = "off") +
theme(
legend.position = "none",
plot.margin = margin(5, 75, 5, 5)
)
It’s not perfect, but I like this a lot better than where we started!
Now that we’ve added our direct labels, things look a little weird
with the gray background of the plot. We can control the plot
background, and a great many other things such as font face and size,
position of labels, grid lines, etc. by modifying the ggplot theme. We
already started working on the theme above when we got rid of the legend
and increased the margin around the plot. There are huge number of arguments to
theme() that you can explore and play with. And there
is also a way to set a new theme for the entire plot. There are a
handful of themes built into ggplot2 that change many aspects of a
theme. Here for instance is theme_minimal():
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
) +
coord_cartesian(clip = "off") +
theme_minimal() +
theme(
legend.position = "none",
plot.margin = margin(5, 75, 5, 5)
)
And here is theme_classic():
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
) +
coord_cartesian(clip = "off") +
theme_classic() +
theme(
legend.position = "none",
plot.margin = margin(5, 75, 5, 5)
)
I tend to use theme_minimal() as a starting place and
build from there. In this case, I think I’d like to get rid of the
x-axis grid lines and make the y-axis title and captions a little
smaller.
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
) +
coord_cartesian(clip = "off") +
theme_minimal() +
theme(
legend.position = "none",
plot.margin = margin(5, 75, 5, 5),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_text(size = 10),
plot.caption = element_text(size = 8)
)
While I’m working on the look of this, I think we might be able to
use some better colors for our counties. If I don’t have a set color
palette for a particular project I am working on, my first stop is Color Brewer, which has a number of
very good palettes. Importantly, you can choose between sequential,
diverging, and qualitative (or categorical) palettes depending on what
kind of data you are plotting. In this case, we have qualitative data
because there is no inherent ordering to the county names. The brewer
palettes are built into {ggplot2} and we can access them using
scale_color_brewer().
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line() +
geom_text(
data = filter(big_county_to_plot, year == 2018),
aes(label = county_name),
hjust = 0
) +
scale_y_continuous(labels = label_comma(scale = 1e5)) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Jail incarceration rate by county",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = "Rate per 100k residents",
color = NULL
) +
coord_cartesian(clip = "off") +
theme_minimal() +
theme(
legend.position = "none",
plot.margin = margin(5, 75, 5, 5),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_text(size = 10),
plot.caption = element_text(size = 8)
)
Alright! I’m almost very happy with this plot, and it certainly is a big improvement over where we started:
I’m going to make a few more little tweaks and get this to the point where I’d feel good about putting this chart in slide deck or report. See if you can spot the additional changes I’ve made.
big_county_to_plot <- az_jail_county_tidy |>
filter(
dem_group == "total",
county_name %in% big_counties
) |>
mutate(
county_name = fct_reorder(county_name, rate, last, .desc = TRUE),
text_label = paste(county_name, "–", comma(rate, 1, scale = 1e5))
)
big_county_to_plot |>
ggplot(aes(year, rate, color = county_name)) +
geom_line(size = 0.75) +
geom_point(
data = filter(big_county_to_plot, year == max(year)),
size = 1.25
) +
geom_text(
data = filter(big_county_to_plot, year == max(year)),
aes(label = text_label),
size = 3.5,
family = "Franklin Gothic Medium",
hjust = 0,
nudge_x = 0.5
) +
scale_x_continuous(
breaks = c(1970, 1980, 1990, 2000, 2010, 2018)
) +
scale_y_continuous(
labels = label_comma(scale = 1e5),
expand = expansion(mult = c(0, 0.05))
) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Jail populations grew from the 1970s to a peak near 2010",
subtitle = "Incarceration rate per 100k residents, select Arizona counties",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = NULL,
color = NULL
) +
coord_cartesian(clip = "off") +
expand_limits(y = 0) +
theme_minimal(base_family = "Franklin Gothic Book") +
theme(
legend.position = "none",
plot.margin = margin(5, 85, 5, 5),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
plot.caption = element_text(size = 8),
plot.title = element_text(family = "Franklin Gothic Medium"),
axis.ticks.x = element_line(color = "gray80"),
axis.ticks.length = unit(1.5, "mm"),
axis.text = element_text(size = 10)
)
One additional neat feature of {ggplot2} is facets. These plots are sometimes called small multiples and can be used to show multiple versions of the same plot with different data. It can be a useful technique when you have too much data to include in a single plot.
In this case, we could revisit our first plot where there were too many counties to include all of them. Instead, we could create a faceted plot where we make one plot per county.
To do this, use facet_wrap() and specify which column in
your data you want use to create the facets
az_jail_county_tidy |>
filter(dem_group == "total") |>
ggplot(aes(year, rate)) +
geom_line() +
facet_wrap(vars(county_name))
Now it’s easy to figure out the county with the big jump in 1999 (La Paz). By default, ggplot will use the same x and y scales for each facet. Sometimes this is useful if you want to compare the total in each facet with the other facets. Other times, you may be more interested in the relative change or trend across the facets. In this case, you can set the scales to be “free” so that the limits of each facet is set individually. Here, for example we can make the y-axis have free scales – notice the y-axis scales for each facet are now different.
az_jail_county_tidy |>
filter(dem_group == "total") |>
ggplot(aes(year, rate)) +
geom_line() +
facet_wrap(vars(county_name), scales = "free_y")
We can also do all our other fancy theming, labeling, and other changes from above, so that we might end up with a final plot like this:
az_jail_county_tidy |>
filter(dem_group == "total", county_name != "La Paz County") |>
mutate(county_name = str_remove(county_name, " County")) |>
ggplot(aes(year, rate)) +
geom_line(color = "steelblue") +
scale_x_continuous(
breaks = c(1970, 1995, 2018)
) +
scale_y_continuous(
labels = label_comma(scale = 1e5),
expand = expansion(mult = c(0, 0.05))
) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Jail populations grew from the 1970s to a peak near 2010",
subtitle = "Incarceration rate per 100k residents, Arizona counties",
caption = "Vera Institute Incarceration Trends",
x = NULL,
y = NULL,
color = NULL
) +
coord_cartesian(clip = "off") +
expand_limits(y = 0) +
facet_wrap(vars(county_name)) +
theme_minimal(base_family = "Franklin Gothic Book") +
theme(
legend.position = "none",
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
plot.caption = element_text(size = 8),
plot.title = element_text(family = "Franklin Gothic Medium"),
axis.ticks.x = element_line(color = "gray80"),
axis.text = element_text(size = 7)
)
With this kind of plot we get some different insights and begin to understand that the steep increase in jail incarceration rates was not evenly distributed across all Arizona counties.